## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 7.0 0.27 0.36 20.7 0.045
## 2 6.3 0.30 0.34 1.6 0.049
## 3 8.1 0.28 0.40 6.9 0.050
## 4 7.2 0.23 0.32 8.5 0.058
## 5 7.2 0.23 0.32 8.5 0.058
## 6 8.1 0.28 0.40 6.9 0.050
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 45 170 1.0010 3.00 0.45 8.8
## 2 14 132 0.9940 3.30 0.49 9.5
## 3 30 97 0.9951 3.26 0.44 10.1
## 4 47 186 0.9956 3.19 0.40 9.9
## 5 47 186 0.9956 3.19 0.40 9.9
## 6 30 97 0.9951 3.26 0.44 10.1
## quality
## 1 6
## 2 6
## 3 6
## 4 6
## 5 6
## 6 6
## 'data.frame': 4898 obs. of 12 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
As I load the data, I found that this contains 12 variables and 4898 observations, the data types are numbers except for ‘quality’, which is an integer.
To get a better understand of each variable, I did some reading about fermentation procedure and quality of wines, I found a few hints that I will base my analysis on: 1. the quality and taste of a wine is largely influenced by the balance between acids and sugar, unbalenced acidity/sugar compromises the taste. 2. free sulfur dioxide is an effective aseptic that influences the quality of wines, but high amount of sulfur dioxide compromises the taste.
Additionally, I did some modification of this dataset. The data type of ‘quality’ is integer, however it also looks like a categorical data type so I converted ‘quality’ into factor for convenient analysis. On the other hand, some quality levels have very small number of observations, so I reduced the group numbers(quality score 3 and 4 =bad, quality score 5~7 = average, quality score 8 and 9 = good) for effective analysis.
table(ww$quality)
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
ww$quality.factor<- as.factor(ww$quality)
#create variable using condition method
cond <- ww$quality>=8
ww$quality.class<-ifelse(cond, 'good', 'average')
cond<-ww$quality<5
ww$quality.class <- ifelse(cond, 'bad', ww$quality.class)
table(ww$quality.class)
##
## average bad good
## 4535 183 180
ww$quality.class <- as.factor(ww$quality.class)
Before looking into each variable, I’d like to get a sense of the entire dataset, below is a matrix plots using ggpairs and ggcorr functions.
In this section I will explore all variables to understand the nature of them, and create some new variables that I think are meaningful for next analysis.
The plots indicated that quality is normal-distributed, however score 3 and 8 have too few observations. In the quality class plot, majority of wines are labled average while bad and good wines have similar number of observations.
Density has a small number of outliers with very large values, after the x axis limitation adjust it is a normal distribution.
Chlorides is also affected by outliers, the original plot has a long tail on the right, majority of the data is normal-distributed.
Because in a wine free sulfur dioxide is the working component that prevents bacteria and it dynamically comes from total sulfur dioxide which released by sulphates, We can tell from above plots that sulphates is the most abundant element and free.sulfur.dioxide is the least, I speculated that the efficiency of free sulfur dioxide release is important for wine quality, so I created a variable “free.sulfur.percent” that devided ‘free.sulfur.dioxide’ by ‘total.sulfur.dioxide’, which represents the relative amount of free sulfur in the wine.
#create variable using vector operation
ww$free.sulfur.percent<-ww$free.sulfur.dioxide/ww$total.sulfur.dioxide
#comparison of 'free.sulfur.dioxide' and 'free.sulfur.percent'
p1=ggplot(ww,aes(x=free.sulfur.dioxide))+
geom_histogram(binwidth = 1)
p2=ggplot(ww,aes(x=free.sulfur.percent))+
geom_histogram(binwidth = 0.01)
grid.arrange(p1, p2)
summary(ww$free.sulfur.percent)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.02362 0.19090 0.25370 0.25560 0.31580 0.71050
The plot and summary shows that the release of free sulfur ranges from 2% to 71%, median is 25%.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
Alcohol is essential component of a wine. In the dataset, the alcohol percentage ranges from 8 to 14.2 with a good normal distribution shape.
Residual sugar distribution is left skewed, so I made a log10 transformation, the new plot appears to be bimodal
Acids are all normal-distributed, we can tell from x axis that fixed acids are much more abundant than the other two kinds.
Since pH by definition is highly related to acids amount, it is expected that pH plot shape is also a normal distribution.
Based on above plots of sugar and acid variables, I wonder how acid sugar balence looks like accross this wine data, so that I created a variable “total.acid” that combines all three acids because they all contributes to the acidity of taste, and I created another variable “acid.sugar.ratio” that devides total acid to residual sugar, and plotted this variable.
#create variable using vector operation
ww$total.acid<-ww$fixed.acidity+ww$volatile.acidity+ww$citric.acid
ww$acid.sugar.ratio<-ww$total.acid/ww$residual.sugar
p8=ggplot(ww,aes(x=total.acid))+
geom_histogram(binwidth = 0.1)
grid.arrange(p5, p8)
#set x axis limits and breaks
ggplot(ww,aes(x=acid.sugar.ratio))+
geom_histogram(binwidth = 0.01)+
scale_x_continuous(limits = c(0, 10),
breaks = seq(0, 10, 1))
Because fixed acids are much more abundent than any other acids, total acid distribution is actually a reflection of fixed acid. This bimodal distribution of acid.sugar.ratio reminds me of two distinctive populations, I speculate there are two wine categories, sweet taste and acidy taste, the sweet wines usually contain less than 2 fold of acids to residual sugar, the acidy wines contain usually contain higher acids. Beased on the shape of the plot, I set acid sugar ratio 2.3 as a cut off and created factor variable ‘category’ for future analysis.
#create variable using condition method and set as a facor
cond <- ww$acid.sugar.ratio<=2.3
ww$category<-ifelse(cond, 'sweet', 'acidy')
ww$category <- as.factor(ww$category)
table(ww$category)
##
## acidy sweet
## 1907 2991
ggplot(ww,aes(x=category))+
geom_bar()
The original data contains 12 variables and 4898 observations(wines), as I created some new variables for analysis, it now contains 17 variables.
The main features of interest are free sulfur dioxide and alcohol, based on my preliminary analysis they are most likely influenced the quality of wines.
Other features such as acid and residual sugar may contain some hidden information. On the other hand, many variables are associated with each other just by definition, such as ph and acidity, and that relationship will not be my main focus.
I created new numerical variables “total.acid” “acid.sugar.ratio” and “free.sulfur.percent” using vector operations, I also created two factor variables “category” and “quality.class” to group the wines based on their attributes.
A few features are right skewed, and bimodal, such as “residual.sugar”, the operation I did is to create variable “acid.sugar.ratio” and assign them into two categories for further analysis.
##
## Pearson's product-moment correlation
##
## data: total.acid and pH
## t = -33.388, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4531918 -0.4075605
## sample estimates:
## cor
## -0.4306513
pH and acidity relate to each other by definition, I plotted it to make sure this dataset doesn’t has major flaw,
In this plot, I looked at the distribution of free sulfur dioxide amount in each quality class (good, average, bad). I plotted the distrubution using ‘free.sulfur.percent’, we can see that distribution of free sulfur dioxide amount is similar between good wines and average wines(20%-40%), but the bad wines contains relatively less(less than 20%) free sulfur dioxide release, probably because less free sulfur dioxide means increasing chance of bacteria contamination.
Frequency polygons more effectively represent the same point.
The plot indicates that more average and bad wines contains lower alcohol, while more good wines tend to contain lower alcohol.
From the plot matrix I noticed that density is highly correlated with both sugar and acid, earlier I created variables “acid.sugar.ratio” and “category”, I’d like to dive deeper on this perspective.
Even though there are more sweet wines than acidy wines, both categories have same normal distribution on quality scores, which means that this is not a biased category.
The boxplot shows majority of sweet wines are of higher density than acidy wines, it is not suprise because molecular weight of glucose is larger than acids.
In Univariate Plots Section, I found that majority of good wines contains more alcohol compare to average and bad wines, here I used boxplot to visualize and use statistics to determine the significance.
## ww$quality.class: average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.40 10.30 10.48 11.30 14.20
## --------------------------------------------------------
## ww$quality.class: bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.40 10.10 10.17 10.80 13.50
## --------------------------------------------------------
## ww$quality.class: good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 11.00 12.00 11.65 12.60 14.00
## Analysis of Variance Table
##
## Response: alcohol
## Df Sum Sq Mean Sq F value Pr(>F)
## quality.class 2 258.3 129.174 88.338 < 2.2e-16 ***
## Residuals 4895 7157.8 1.462
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
To further investigate this point, I layered a scatterplot with boxplots to visualize individual observations and summaries.
##
## Pearson's product-moment correlation
##
## data: alcohol and quality
## t = 33.858, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4126015 0.4579941
## sample estimates:
## cor
## 0.4355747
The plot shows that from quality score 3-5, alcohol slightly decreased, and from quality score 5-9, which represents majority of the wine data, the alcohol increases with quality score.
## ww$quality.class: average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.02362 0.19320 0.25490 0.25690 0.31630 0.71050
## --------------------------------------------------------
## ww$quality.class: bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03371 0.10540 0.16130 0.18880 0.23850 0.65680
## --------------------------------------------------------
## ww$quality.class: good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.07895 0.22310 0.28770 0.28930 0.33620 0.60380
## Analysis of Variance Table
##
## Response: free.sulfur.percent
## Df Sum Sq Mean Sq F value Pr(>F)
## quality.class 2 1.028 0.51410 59.575 < 2.2e-16 ***
## Residuals 4895 42.241 0.00863
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
On the other hand, free sulfur percent seems to be anther intereting feature correlated with quality. I used same plots to investigate this feature.
##
## Pearson's product-moment correlation
##
## data: free.sulfur.percent and quality
## t = 14.076, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1701474 0.2239834
## sample estimates:
## cor
## 0.1972141
As the results show, in high quality wines, higher percent of free sulfur dioxide is released from total sulfur dioxide.
## ww$quality.class: average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04589 0.05000 0.34600
## --------------------------------------------------------
## ww$quality.class: bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01300 0.03750 0.04600 0.05056 0.05400 0.29000
## --------------------------------------------------------
## ww$quality.class: good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01400 0.03000 0.03550 0.03801 0.04400 0.12100
## Analysis of Variance Table
##
## Response: chlorides
## Df Sum Sq Mean Sq F value Pr(>F)
## quality.class 2 0.01509 0.0075463 15.905 1.302e-07 ***
## Residuals 4895 2.32241 0.0004744
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Pearson's product-moment correlation
##
## data: chlorides and quality
## t = -15.024, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2365501 -0.1830039
## sample estimates:
## cor
## -0.2099344
Using similar statistics and plots, I found that high quality wines contain less chlorides.
I looked at the distribution of ‘alcohol’, ‘free.sulfur.percent’, and ‘chlorides’ in wines of different quality scores. alcohol and free sulfur percent positively associated with quality score, while chlorides negatively associated with quality score, all of them are statistically significant.
I investigated the variable ‘acid.sugar.ratio’ I created in the last section, it has a non-linear relationship with density.
I found that alcohol is strongly associated with wine quality.
In previous sections I found that there are two wine catergories based on their acids/sugar ratio, both categories have no bias on quality distribution. I also found some interesting features that are significantly associated with quality scores such as alcohol, sulfur dioxide and chlorides, here I’d like to further investigate how those features influences quality in different categories.
##
## Pearson's product-moment correlation
##
## data: alcohol and quality
## t = 21.088, df = 1905, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3979196 0.4707314
## sample estimates:
## cor
## 0.4350364
##
## Pearson's product-moment correlation
##
## data: alcohol and quality
## t = 27.372, df = 2989, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4185567 0.4758859
## sample estimates:
## cor
## 0.4476812
This plot shows that acidy wines generally contains more alcohol than sweet wines. Both sweet and acidy wines exhibit positive association between alcohol and quality, this suggests that alcohol is important for both sweet and acid wines.
##
## Pearson's product-moment correlation
##
## data: residual.sugar and quality
## t = 7.1028, df = 1905, p-value = 1.72e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1165746 0.2040375
## sample estimates:
## cor
## 0.1606214
##
## Pearson's product-moment correlation
##
## data: residual.sugar and quality
## t = -7.5156, df = 2989, p-value = 7.444e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1711912 -0.1008383
## sample estimates:
## cor
## -0.1361865
The above plots and statistics shows that: 1. while sugar amount are of huge difference, acids amounts are almost same in both categories, so the different flavors are mainly determined by the amount of sugar. 2. in acidy wines, sugar is slightly positively associated with quality, and in sweet wines it is an opposite. This results again reinforce the idea that balence between acidity and sweetness is important for a wine.
In sweet wines, both free sulfur dioxide and total sulfur dioxide are higher than acidy wines, especially in average scored wines,this makes sense becuase acidy environment itself prevents bacteria growth, you don’t need to add too much exogenous sulfur dioxide as preservative. On the other hand, sulfur dioxide compromise the taste, so high quality sweet wines may use other antiseptic techniques rather than just add more sulfur dioxide.
Density is a feature that has linear-like relationships with multiple vatiables.
By definition, density of a wine should be influenced by materials like alcohol, sugar, salt,etc. Those above plots indicates good linear relationship between density and alcohol,density and residual.sugar/total.acid ratio, but not chlorides.
##
## Calls:
## m1: lm(formula = density ~ alcohol, data = ww)
## m2: lm(formula = density ~ alcohol + residual.sugar + residual.sugar:total.acid,
## data = ww)
## m3: lm(formula = density ~ alcohol + residual.sugar + chlorides +
## residual.sugar:total.acid, data = ww)
##
## ===================================================================
## m1 m2 m3
## -------------------------------------------------------------------
## (Intercept) 1.014*** 1.005*** 1.004***
## (0.000) (0.000) (0.000)
## alcohol -0.002*** -0.001*** -0.001***
## (0.000) (0.000) (0.000)
## residual.sugar -0.000*** -0.000***
## (0.000) (0.000)
## residual.sugar x total.acid 0.000*** 0.000***
## (0.000) (0.000)
## chlorides 0.003***
## (0.001)
## -------------------------------------------------------------------
## R-squared 0.609 0.928 0.929
## adj. R-squared 0.609 0.928 0.929
## sigma 0.002 0.001 0.001
## F 7613.412 21176.390 15973.967
## p 0.000 0.000 0.000
## Log-likelihood 23815.906 27978.321 27991.876
## Deviance 0.017 0.003 0.003
## AIC -47625.812 -55946.642 -55971.751
## BIC -47606.322 -55914.159 -55932.772
## N 4898 4898 4898
## ===================================================================
##
## Call:
## lm(formula = density ~ alcohol, data = ww)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.005475 -0.001238 -0.000153 0.001156 0.047201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.014e+00 2.300e-04 4407.87 <2e-16 ***
## alcohol -1.896e-03 2.173e-05 -87.25 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.001871 on 4896 degrees of freedom
## Multiple R-squared: 0.6086, Adjusted R-squared: 0.6085
## F-statistic: 7613 on 1 and 4896 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = density ~ alcohol + residual.sugar + residual.sugar:total.acid,
## data = ww)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0025036 -0.0005100 -0.0001178 0.0003840 0.0176851
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.005e+00 1.181e-04 8503.23 <2e-16 ***
## alcohol -1.221e-03 1.041e-05 -117.29 <2e-16 ***
## residual.sugar -1.472e-04 1.340e-05 -10.98 <2e-16 ***
## residual.sugar:total.acid 6.595e-05 1.709e-06 38.58 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0008001 on 4894 degrees of freedom
## Multiple R-squared: 0.9285, Adjusted R-squared: 0.9284
## F-statistic: 2.118e+04 on 3 and 4894 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = density ~ alcohol + residual.sugar + chlorides +
## residual.sugar:total.acid, data = ww)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0024713 -0.0005101 -0.0001222 0.0003858 0.0175285
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.004e+00 1.367e-04 7346.728 < 2e-16 ***
## alcohol -1.200e-03 1.113e-05 -107.842 < 2e-16 ***
## residual.sugar -1.446e-04 1.338e-05 -10.812 < 2e-16 ***
## chlorides 2.928e-03 5.618e-04 5.211 1.95e-07 ***
## residual.sugar:total.acid 6.577e-05 1.705e-06 38.569 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.000798 on 4893 degrees of freedom
## Multiple R-squared: 0.9289, Adjusted R-squared: 0.9288
## F-statistic: 1.597e+04 on 4 and 4893 DF, p-value: < 2.2e-16
Linear regression models shows that variable ‘alcohol’ weights 60.9% in density, addition of ‘residual.sugar/total.acid ratio’ improves the weight to 92.8%, which makes m2 a decent model, on the other hand ‘chlorides’ only improves 0.1%. The conclusion for this part is that the density of a wine is mainly influenced by alcohol percent and sugar acid balence.
It is very interesting that I initially defined ‘sweet wines’ and ‘acidy wines’ only based on histogram of acid.sugar.ratio because of its bimodal shape, as I further investigated other features, I found more supportive evidence that they are two distict wine groups. First of all, they are of different density and contains different alcohol. Secondly, residual sugar conversely influence quality in each kind of wines. Last, sulfur dioxide amount are also different in each kind, those findings also correspond to facts and common sense which I explained in the section.
It is surprising that when I investigate density, I found density does not have linear relationship with neither acids or sugar, but has good linear relationship with the residual.sugar/total.acid ratio. This finding greatly improved the regression model.
I created a linear model of density, it is a strong model because it account for 92.9% of the variance in wine density. It also has limitation in application because it won’t work if any of the 3 variables is missed in a givin observation.
Tip: You’ve done a lot of exploration and have built up an understanding of the structure of and relationships between the variables in your dataset. Here, you will select three plots from all of your previous exploration to present here as a summary of some of your most interesting findings. Make sure that you have refined your selected plots for good titling, axis labels (with units), and good aesthetic choices (e.g. color, transparency). After each plot, make sure you justify why you chose each plot by describing what it shows.
When I plotted residual sugar I noticed that it is left skewed and if using log10 transformation the distribution appears to be bimodal while acides are normal-distributed, indicates that high or low sugar amount may define two different wine flavors. Considering that sugar acid balence is important to wine flavor, I created variable ‘acid.sugar.ratio’, the histogram indicates that this variable can also represent two different wines based on distinct flavors.
The plot shows distribution of alcohol on each quality score and quality class, it demonstrates positive associate of alcohol and quality, which is one of the main finding in this analysis.
This plot is very interesting. First, it again indicates two distinct populations because there are many overplotting at right side, suggests Sugar Acid Ratio of that population have very weak influence on density, on the contrary,the other popluation shows linear relationship of density and Sugar Acid Ratio. Secondly, this plot shows alcohol almost equally influences density of the entire population.